Temporal Distribution Based Software Cache Partition To Reduce I-cache Misses
نویسندگان
چکیده
As multimedia applications on mobile devices become more computationally demanding, embedded processors with one level I-cache become more prevalent, typically with a combined I-cache and SRAM of 32KB ~ 48KB total size. Code size reduction alone is no longer adequate for such applications since program sizes are much larger than the SRAM and I-cache combined. For such systems, a 3% I-cache miss rate could easily translate to more than 50% performance degradation. As such, code layout to minimize I-cache miss is essential to reduce the cycles lost. In this paper, we propose a new code layout algorithm – temporal distribution based software cache partition with focus on multimedia code for mobile devices. This algorithm is built on top of Open64’s [14] code reordering scheme. By characterizing code according to their temporal reference distribution characteristics, we partition the code and map them to logically different regions of the cache. Both capacity and conflict misses can be significantly reduced, and the cache is more effectively used. The algorithm has been implemented as a part of our tool-chain for our products. We compare our results with previous works and show more efficacy in reducing I-cache misses with our approach, especially for applications suffering from capacity misses. 1. Observation and Motivation As multi-core and multi-thread are being employed in embedded processors, instruction fetch efficiency is even more important to total system performance. Instruction cache performance becomes one of the most critical factors influencing the entire system design. For example, in our video codec processor, a cache miss rate of 3% will cause as much as 50% performance degradation for the H.264 encoder which is the most computationally demanding video encoding standard today. Traditional code layout algorithms use both basic block and procedure as the unit for code positioning. Sometimes, new “procedures” are generated by procedure splitting before further positioning. In this work, we always split procedure into two or more sections as the unit for layout; called “code blocks”. The record of code blocks’ execution sequence at runtime is used to analyze the temporal characteristics of them for further code layout; we call the sequence “cb-trace”. By analyzing the runtime temporal characteristics of the code blocks, we observed two kinds of code blocks: code blocks which are uniformly distributed along the cb-trace and code blocks which exhibit a large skew in their distribution. For example, consider the following temporal sequence where each alphabet letter represents one code block: ABCDEF(UV)ABCDEF(PQ)ABCDEF(XY)ABCDEF In the above sequence, A,B,C,D,E,F have uniform distribution, while U,V,P,Q,X,Y do not have uniform distribution. Code blocks exhibiting a large skew in the reference distribution have good temporal locality [13]. They are usually good candidates for traditional code layout algorithms, usually focuses on reducing I-cache conflict misses as they can be well placed to reduce this kind of misses very effectively. Consider the interleaved relationship between the pairs U and V, P and Q, X and Y; we need only two cache lines to hold the six code blocks, assuming each has the size of one cache line. We call them code blocks with good temporal locality. On the other hand, code blocks exhibiting uniform reference distribution have little or no exploitable temporal locality. This is due to the following reasons: 1. They generally interleave pervasively with other code blocks and will cause many misses when sharing cache lines with other code blocks. E.g. we need at least six cache lines to hold A,B,C,D,E,F to avoid cache misses, since they are interleaved with all the other code blocks. To avoid cache misses, we practically have to let them hold cache lines exclusively. 2. They often have relatively long reuse distance and hence are prone to suffering from capacity miss. Because the traditional code layout algorithms are more effective on conflict misses, not much work has been done for programs with large capacity. Since this kind of code blocks have temporally regular pattern, we call them code blocks with good temporal regularity. From the above example, it can be seen that different kinds of code blocks need different layout policies. For code blocks with good temporal locality, multiple code blocks can share same cache lines and still incur no more cache misses. For code blocks with good temporal regularity, they had better hold cache lines as much as possible to avoid cache misses. However, traditional code layout algorithms do not distinguish the difference between these two kinds of code blocks. When the cache capacity is large enough, we can do a proper layout using the traditional algorithm. Suppose we have more than eight cache lines. A traditional code layout algorithm (e.g. TRG based algorithm) can generate a placement to avoid all the conflict misses, as shown by the following: A B C D E F U/P/X V/Q/Y However, when there is not enough cache capacity, say, only six cache lines, different layout methods will generate different number of cache misses E.g. M1 will have 24 misses and M2 have only 18 misses as shown by the following. M1: A/E B/F C D U/P/X V/Q/Y M2: A B C D U/P/X/E V/Q/Y/F The key point here is to prevent code blocks with good temporal regularity from sharing cache lines among themselves as much as possible (since it incurs cache line thrashing too easily), and to let them hold cache lines exclusively or share cache lines only with code blocks with good temporal locality if needed. For example, when A and E share the same cache line, A and E will suffer from cache misses each time they are referenced. However, when E shares a cache line with U, P and X, only references to E incur more cache misses due to the good temporal locality of U, P and X. Based on the above observation, we devised a temporal distribution based software cache partition algorithm to do code layout. Firstly, we characterize the code blocks by temporal distribution and classify them according to good temporal locality and good temporal regularity, respectively. Secondly, we partition the cache into two regions to hold these two types of code blocks. 2. Solutions and Methodology Like traditional code layout algorithms, our partition based algorithm is also heuristic based. Since the code placement policy depends heavily on the program characteristics, it is important to design an adaptive algorithm. Since our processor is targeted for multimedia, we focus our design and evaluation on multimedia applications only. Our layout process includes five steps: 1) code block formation, including basic-block level (bb-level) reorder and procedure splitting optimization, 2) execution and cbtrace generation, 3) cb-trace analysis and temporal distribution calculation, 4) iterative partition of cache and code blocks, and 5) layout and placement generation. The following is the summary of our solutions: 1. Characterize the temporal distribution of a cb-trace, in terms of temporal regularity and locality. We use statistical analysis of positions in cb-traces to do this. 2. Since we want code blocks with good temporal regularity to hold cache lines exclusively, we only select the ones whose cache misses are critical for total performance. Good candidates of these code blocks should have the following characteristics: a) They should be hot code blocks and directly affect application performance. b) They should be “dense” code blocks, that is, code blocks with few branches. We use instruction density (dynamic instruction count of the code block divided by its size) to evaluate the denseness. This is beneficial to improve spatial usage of cache lines. 3. Because different code sections inside one procedure may exhibit different characteristics, hotness, and density, the various parts of a program procedure may be completely different from each other, we do bb-level reorder and procedure splitting before code layout. This improves the uniformity of the hotness and density of code blocks generated. Then, we generate distinct placement for these split code blocks. The algorithm and implementation of bb-level reorder and procedure splitting are based on Open64 and will be discussed in detail in section 3.1. 4. To make the partition applicable to different program characteristics, we developed an iterative cache partition algorithm, which makes the algorithm more flexible and easy to use. 5. Finally, the TRG algorithm (which will be explained in detail in section 3.2) is used to further place code blocks inside each partitioned cache region. The rest of the paper is organized as the following. Section 3 reviews the related work. Section 4 gives the equations to calculate temporal distribution of code blocks and classify the code blocks. Section 5 describes the iterative partition and layout algorithm. Section 6 and 7 evaluate the code layout algorithm by some typical multimedia embedded applications, including four video and one audio program. We conclude in Section 8.
منابع مشابه
MiniTasking: Improving Cache Performance for Multiple Query Workloads
This paper proposes a novel idea, called MiniTasking to reduce the number of cache misses by improving the data temporal locality for multiple concurrent queries. Our idea is based on the observation that, in many workloads such as decision support systems (DSS), there is usually significant amount of data sharing among different concurrent queries. MiniTasking exploits such data sharing charac...
متن کاملAchieving high performance in bus-based shared-memory multiprocessors
In bus-based SMPs, cache misses and bus traffic pose key obstacles to high performance. To overcome these problems, several techniques have been proposed. Cache prefetching, read snarfing, software-controlled updating, and cache injection reduce cache misses; migrate-on-dirty, adaptive migratory detection, load-exclusive instruction, and exclusive prefetching reduce invalidation bus traffic.
متن کاملCode Reordering for Multi-level Cache Hierarchies
As the gap between memory and processor performance continues to grow, it becomes increasingly important to exploit cache memory eeectively. Both hardware and software techniques can be used to better utilize the cache. Many software solutions produce new programs layouts to better utilize the available memory and cache address space. In this paper we present a new link-time code reordering alg...
متن کاملTemporal-Based Procedure Reordering for Improved Instruction Cache Performance
As the gap between memory and processor performance continues to grow, it becomes increasingly important to exploit cache memory effectively. Both hardware and software techniques can be used to better utilize the cache. Hardware solutions focus on organization, while most software solutions investigate how to best layout a program on the available memory space. In this paper we present a new l...
متن کاملReduction in Cache Memory Power Consumption based on Replacement Quantity
Today power consumption is considered to be one of the important issues. Therefore, its reduction plays a considerable role in developing systems. Previous studies have shown that approximately 50% of total power consumption is used in cache memories. There is a direct relationship between power consumption and replacement quantity made in cache. The less the number of replacements is, the less...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008